Pandas Data Structures


In [1]:
import numpy as np
import pandas as pd

Understanding language's data structures is the most important part for a good programming experience. Poor understanding of data structures leads to poor code in terms of efficiency and readability.

These notes are devoted to walk through all Pandas structures but also providing further readings for a deeper knowledge. A previous Numpy datastructures knowledge is required.

The covered Pandas data structures are:

  • Series:
  • DataFrames:

Series

Series is an one-dimension structures that can hold any data type (boolean, integer, float or even Python objects).

Series Creation

We can create Series from:

  • Python dictionaries
  • Python List
  • NumPy ndarrays
  • a sclar value

As optional parameter, we can provide an index. The passed index is a list of axis labels. Indexes, this will provide an effective way to access data (slicing).

Each Series has a dtype that corresponds to the variable type of Series.

From dictionary:

In this case if no index is provided, it is extracted from dictionary keys, while the data is extracted from values.


In [2]:
d = {'a':5.,'b':5.,'c':5.}
i = ['x','y','z']
s1 = pd.Series(d)
print(s1)
s1.index


a    5.0
b    5.0
c    5.0
dtype: float64
Out[2]:
Index(['a', 'b', 'c'], dtype='object')

Otherwise, if index is passed, values with keys in index are pulled out, the rest are assigned to NaN. A NaN value means not assigned, and we will have to deal with these values in the future.

What is NaN?? (from Pandas docs) Some might quibble over our usage of missing. By “missing” we simply mean null or “not present for whatever reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.


In [3]:
d = {'a':5,'b':5,'c':5}
i = ['x','y','a','b']
s1 = pd.Series(d, index = i)
print(s1)
print(s1.dtype)
s1.index


x    NaN
y    NaN
a    5.0
b    5.0
dtype: float64
float64
Out[3]:
Index(['x', 'y', 'a', 'b'], dtype='object')

From list:

We can create series from lists. In this case we can provide an index if desired. However, index must have the same length as constructor list.


In [4]:
l = [5,5,5]
s1 = pd.Series(l)
print(s1)
i = ['x','y','z']
s1 = pd.Series(l, index = i)
print(s1)
#This would raise an error
#i = ['x','y','a','b']
#s1 = pd.Series(l, index = i)
#print s1
print(s1.dtype)
s1.index


0    5
1    5
2    5
dtype: int64
x    5
y    5
z    5
dtype: int64
int64
Out[4]:
Index(['x', 'y', 'z'], dtype='object')

From numpy array:


In [5]:
s2 = pd.Series(np.array([3,20,5]),index=['a','b','c'])
print(s2)
print(s2.dtype)
s2.index


a     3
b    20
c     5
dtype: int32
int32
Out[5]:
Index(['a', 'b', 'c'], dtype='object')

From scalar:


In [6]:
s3 = pd.Series(5,index=['a','b','c'])
print(s3)
print(s3.dtype)
s3.index


a    5
b    5
c    5
dtype: int64
int64
Out[6]:
Index(['a', 'b', 'c'], dtype='object')

Series can have the attribute name. When dealing with DataFrames, Series names will be automatically assigned with its column name.


In [ ]:
s3 = pd.Series(5,index=['a','b','c'], name = 'Series3')
s3.name

Series Accessing

Series can be accessed through position (numerical index), boolean ¿list? or key (axis of labels). Accessing by position is like working with numpy ndarrays while accessing through keys (axis of labels) is like working with dictionaries.

Position accessing


In [7]:
s2


Out[7]:
a     3
b    20
c     5
dtype: int32

In [8]:
s2[1]


Out[8]:
20

In [10]:
s2[:1]


Out[10]:
a    3
dtype: int32

Boolean list accessing


In [11]:
s2[[True,True,False]]


Out[11]:
a     3
b    20
dtype: int32

In [14]:
s2[s2>4]


Out[14]:
b    20
c     5
dtype: int32

Key accessing


In [15]:
s2[['a','b']]


Out[15]:
a     3
b    20
dtype: int32

In [16]:
s2['a']


Out[16]:
3

In [17]:
'a' in s2


Out[17]:
True

In case of accessing an nonexistent key, a KeyError exception is thrown


In [19]:
try:
    s2['z']
except KeyError:
    print("Error handled")


Error handled

To avoid errors, we can use Series get function, where a default value is returned in case of error.


In [20]:
s2.get('x',np.NaN)


Out[20]:
nan

Series Operations

Vectorized operations can be done over pandas Series and also Series are accepted as most of NumPy operations. The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.


In [22]:
s2


Out[22]:
a     3
b    20
c     5
dtype: int32

In [21]:
print(s2 + 23)
# we can apply np functions

np.add(s2,23) == s2 + 23


a    26
b    43
c    28
dtype: int32
Out[21]:
a    True
b    True
c    True
dtype: bool

In [23]:
s1


Out[23]:
x    5
y    5
z    5
dtype: int64

In [24]:
s2


Out[24]:
a     3
b    20
c     5
dtype: int32

Operations are performed index-wise.


In [28]:
s2.reset_index() + s1.reset_index()


Out[28]:
index 0
0 ax 8
1 by 25
2 cz 10

In [ ]:
(s2 + s1).dropna()

In [ ]:
s2 ** 3

In [ ]:
np.exp(s2)

Series dtypes

Dtype can be forced at Series creation, if None, the dtype is inferred.


In [34]:
pd.Series([1,2,3,4,5],dtype=np.float32)


Out[34]:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float32

For numerical variables, most common dtypes will be int and float.

For categorical variables strings are common types. However, when working with labels, more advanced categorical data management can be used (http://pandas.pydata.org/pandas-docs/stable/categorical.html) http://pandas.pydata.org/pandas-docs/stable/categorical.html]

Why strings are dtype object? (from (http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object)[http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object])

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.


In [35]:
pd.Series(["a","b","c"],dtype=str)


Out[35]:
0    a
1    b
2    c
dtype: object

Dates are categorical or numeric?

Days of the month, months, days of the week, etc... are considered categorical.

Specific dates, such as the day payments are received, birth dates, etc... are numeric.


In [ ]:
pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

In [ ]:
import datetime

base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 7)]
date_s = pd.Series(date_list)
print(date_s)
date_s[1] > date_s[2]

Missing Data

You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.

For example, numeric containers will always use NaN regardless of the missing value type chosen:


In [36]:
import numpy as np

s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan

print(s)

s = pd.Series([1, 2, 3])
s.loc[0] = None
s.loc[1] = np.nan

print(s)


0    None
1     NaN
2       c
dtype: object
0    NaN
1    NaN
2    3.0
dtype: float64

Missing values propagate naturally through arithmetic operations between pandas objects.


In [37]:
s1 = pd.Series([1,2,3])
s2 = pd.Series([1,np.nan,3])

s1 + s2


Out[37]:
0    2.0
1    NaN
2    6.0
dtype: float64

In [38]:
s1.index


Out[38]:
RangeIndex(start=0, stop=3, step=1)

In [39]:
s2.index


Out[39]:
RangeIndex(start=0, stop=3, step=1)

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:

  • When summing data, NA (missing) values will be treated as zero
  • If the data are all NA, the result will be NA
  • Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays

Cleaning/filling missing data:

pandas objects are equipped with various data manipulation methods for dealing with missing data.

  • The fillna function can “fill in” NA values with non-null data
  • With dropna () you can simply exclude labels from a data set which refer to missing data
  • With isnull() you can return a boolean array to select not assigned values

In [40]:
s2 = pd.Series([1,np.nan,3])
print(s2.fillna(0))
print(s2.dropna())
print(s2.isnull())


0    1.0
1    0.0
2    3.0
dtype: float64
0    1.0
2    3.0
dtype: float64
0    False
1     True
2    False
dtype: bool

In [42]:
s2[s2.isnull()]


Out[42]:
1   NaN
dtype: float64

Exercises:

Exercise 1: Load iqsize.csv using csv library as a dictionary of Series.

Exercise 2: Check Series dtype. Are they correct? Why? What is the problem of having numerical variables as categorical?

Exercise 3: For sex variable, select those that are males (including synonyms).

Exercise 4: Count how many women and how many men there are. How many missing elements there are?

Examples with Series

Before loading data from different sources, we can have some examples using series with self-generated data.


In [ ]:
import csv
import urllib2

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = urllib2.urlopen(url)
data = list(csv.reader(response))

#data is a list of list

We can create series from a list of lists directly from data variable.


In [ ]:
pd.Series(data)

This is not a very useful Series object, as access to list items is not syntactically nice. However, let's try to put the all countries' airports in a Series.


In [ ]:
countries = pd.Series(np.array([airport[3] for airport in data]))
# this is a more interesting Series object
countries

In [ ]:
print(countries.index)

In [ ]:
## Access by number
print("-> Access by number")
print(countries[0])
## we can evaluate a function over all elements to get a boolean series
print((countries == "Spain").head())
print((countries[countries == "Spain"]).head())
print((countries[countries != "Spain"]).head())
print((countries[countries != "Spain"]).tail())
# we can have the list of countries that have (at least
print(countries.unique())

In [ ]:
# we can also access by key
countries[81]

We can see that there's an error between index and the number provided as index in the csv. Let's try to see what happened with the knowledge we have with Series.


In [ ]:
csv_index = pd.Series(np.array([airport[0] for airport in data]))
print(csv_index[csv_index.astype(int) != csv_index.index + 1])

In [ ]:
# we get the list of all missing values
shifted_index = csv_index[csv_index.index-1]
shifted_index = shifted_index.fillna(0)
shifted_index.index = csv_index.index
# we get the list of non consecutive values
non_consecutive = csv_index[csv_index.astype(int) != shifted_index.astype(int) + 1].astype(int)
print(non_consecutive.head(10))
# check if our assumption is true (1 at most consecutive values are missing)
print("the assumption is:", len(non_consecutive) + len(csv_index) == csv_index[len(csv_index)-1])
# let's see what happens
difference = (shifted_index.astype(int) + 1).add(-csv_index.astype(int))
print(difference[difference < -1])

DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of different types. It can be seen as a spreadsheet, where columns are Series or a Python dictionary where Series can be accessed through labels.

DataFrame Creation

We can create DataFrames from:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

From dict of Series or dict

The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.


In [43]:
d = {'one': pd.Series([1,2,3],index=['a','b','c']),
         'two': pd.Series([1,2,3,4],index=['a','b','c','z']),
         'three':{'a':1}}
    df = pd.DataFrame(d)
    df


Out[43]:
one three two
a 1.0 1.0 1
b 2.0 NaN 2
c 3.0 NaN 3
z NaN NaN 4

In [44]:
pd.DataFrame(d, index=['d', 'b', 'a'])


Out[44]:
one three two
d NaN NaN NaN
b 2.0 NaN 2.0
a 1.0 1.0 1.0

In [45]:
pd.DataFrame(d, index=['d', 'b', 'a'], 
             columns=['two', 'three','four'])


Out[45]:
two three four
d NaN NaN NaN
b 2.0 NaN NaN
a 1.0 1.0 NaN

The row and column labels can be accessed respectively by accessing the index and columns attributes:


In [46]:
df.index


Out[46]:
Index(['a', 'b', 'c', 'z'], dtype='object')

In [47]:
df.columns


Out[47]:
Index(['one', 'three', 'two'], dtype='object')

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.


In [50]:
d = {'one' : [1., 2., 3.,4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)


Out[50]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In [52]:
pd.DataFrame(d) + pd.DataFrame(d, index = ['a','b','c','d'])


Out[52]:
one two
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
a NaN NaN
b NaN NaN
c NaN NaN
d NaN NaN

From structured or record array

This case is handled identically to a dict of arrays.


In [53]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
print(data)
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
print(data)
df = pd.DataFrame(data)
print(df)
df.dtypes


[(0, 0., b'') (0, 0., b'')]
[(1, 2., b'Hello') (2, 3., b'World')]
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'
Out[53]:
A      int32
B    float32
C     object
dtype: object

In [ ]:
pd.DataFrame(data, index=['first', 'second'])

In [ ]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

From a list of dicts


In [54]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)


Out[54]:
a b c
0 1 2 NaN
1 5 10 20.0

In [ ]:
pd.DataFrame(data2, index=['first', 'second'])

In [ ]:
pd.DataFrame(data2, columns=['a', 'b'])

From a dict of tuples

You can automatically create a multi-indexed frame by passing a tuples dictionary


In [55]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})


Out[55]:
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

Column selection, addition, deletion

(from Pandas Docs)

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:


In [58]:
import pandas as pd
df = pd.DataFrame({"one":[1.6,2.2,3.4,3.5],"two":[1.5,2.1,3.9,np.nan],"three":[1.2,2.80,3.80,np.nan]})
print(df)

df['four'] = df['one'] * df['two']
display(df['flag'] = df['one'] > 2
df


   one  three  two
0  1.6    1.2  1.5
1  2.2    2.8  2.1
2  3.4    3.8  3.9
3  3.5    NaN  NaN
Out[58]:
one three two four flag
0 1.6 1.2 1.5 2.40 False
1 2.2 2.8 2.1 4.62 True
2 3.4 3.8 3.9 13.26 True
3 3.5 NaN NaN NaN True

In [57]:
type(df['one'])


Out[57]:
pandas.core.series.Series

Columns can be deleted or popped like with a dict:


In [ ]:
del df['two']
three = df.pop('three')
df

When inserting a scalar value, it will naturally be propagated to fill the column:


In [59]:
df['foo'] = 'bar'
df


Out[59]:
one three two four flag foo
0 1.6 1.2 1.5 2.40 False bar
1 2.2 2.8 2.1 4.62 True bar
2 3.4 3.8 3.9 13.26 True bar
3 3.5 NaN NaN NaN True bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:


In [60]:
df['one_trunc'] = df['one'][:2]
df


Out[60]:
one three two four flag foo one_trunc
0 1.6 1.2 1.5 2.40 False bar 1.6
1 2.2 2.8 2.1 4.62 True bar 2.2
2 3.4 3.8 3.9 13.26 True bar NaN
3 3.5 NaN NaN NaN True bar NaN

In [61]:
df['one'][:2]


Out[61]:
0    1.6
1    2.2
Name: one, dtype: float64

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:


In [ ]:
df.insert(1, 'rand', np.random.randint(1,10,df["one"].size))
df

Indexing / Selection

The basics of indexing are as follows:

Operation Syntax Result

  • df[col]: Select column (returns Series)
  • df.loc[label]: Select row by label (returns Series)
  • df.iloc[loc]: Select row by integer location (returns Series)
  • df[5:10]: Slice rows (returns DataFrame)
  • df[bool_vec]: Select rows by boolean vector (returns DataFrame)

Row selection, for example, returns a Series whose index is the columns of the DataFrame:


In [62]:
df


Out[62]:
one three two four flag foo one_trunc
0 1.6 1.2 1.5 2.40 False bar 1.6
1 2.2 2.8 2.1 4.62 True bar 2.2
2 3.4 3.8 3.9 13.26 True bar NaN
3 3.5 NaN NaN NaN True bar NaN

In [63]:
print("> Select column")
print(df["one"])
print("> Select row by index label")
print(df.loc[1])
print("> Select row by integer location")
print(df.iloc[1])
print("> Slice rows")
print(df[2:4])
print("> Select rows by boolean vector")
print(df[[True,True,True,False]])


> Select column
0    1.6
1    2.2
2    3.4
3    3.5
Name: one, dtype: float64
> Select row by index label
one           2.2
three         2.8
two           2.1
four         4.62
flag         True
foo           bar
one_trunc     2.2
Name: 1, dtype: object
> Select row by integer location
one           2.2
three         2.8
two           2.1
four         4.62
flag         True
foo           bar
one_trunc     2.2
Name: 1, dtype: object
> Slice rows
   one  three  two   four  flag  foo  one_trunc
2  3.4    3.8  3.9  13.26  True  bar        NaN
3  3.5    NaN  NaN    NaN  True  bar        NaN
> Select rows by boolean vector
   one  three  two   four   flag  foo  one_trunc
0  1.6    1.2  1.5   2.40  False  bar        1.6
1  2.2    2.8  2.1   4.62   True  bar        2.2
2  3.4    3.8  3.9  13.26   True  bar        NaN

Both loc and iloc methods can be used to access cells or groups of cells. In this case the result is a DataFrame


In [64]:
print("> Slice the whole DF")
print(df.loc[:,:])
print("> Slice one row")
print(df.loc[1,:])
print("> Slice one column")
print(df.loc[:,"one"])
print("> Slice two columns")
print(df.loc[:,["one","rand"]])
print("> Slice two columns and two rows")
print(df.loc[[1,2],["one","rand"]])

print("> Slice the whole DF")
print(df.iloc[:,:])
print("> Slice one row")
print(df.iloc[1,:])
print("> Slice one column")
print(df.iloc[:,1])
print("> Slice two columns")
print(df.iloc[:,[1,2]])
print("> Slice two columns and two rows")
print(df.iloc[[1,2],[1,2]])


> Slice the whole DF
   one  three  two   four   flag  foo  one_trunc
0  1.6    1.2  1.5   2.40  False  bar        1.6
1  2.2    2.8  2.1   4.62   True  bar        2.2
2  3.4    3.8  3.9  13.26   True  bar        NaN
3  3.5    NaN  NaN    NaN   True  bar        NaN
> Slice one row
one           2.2
three         2.8
two           2.1
four         4.62
flag         True
foo           bar
one_trunc     2.2
Name: 1, dtype: object
> Slice one column
0    1.6
1    2.2
2    3.4
3    3.5
Name: one, dtype: float64
> Slice two columns
   one  rand
0  1.6   NaN
1  2.2   NaN
2  3.4   NaN
3  3.5   NaN
> Slice two columns and two rows
   one  rand
1  2.2   NaN
2  3.4   NaN
> Slice the whole DF
   one  three  two   four   flag  foo  one_trunc
0  1.6    1.2  1.5   2.40  False  bar        1.6
1  2.2    2.8  2.1   4.62   True  bar        2.2
2  3.4    3.8  3.9  13.26   True  bar        NaN
3  3.5    NaN  NaN    NaN   True  bar        NaN
> Slice one row
one           2.2
three         2.8
two           2.1
four         4.62
flag         True
foo           bar
one_trunc     2.2
Name: 1, dtype: object
> Slice one column
0    1.2
1    2.8
2    3.8
3    NaN
Name: three, dtype: float64
> Slice two columns
   three  two
0    1.2  1.5
1    2.8  2.1
2    3.8  3.9
3    NaN  NaN
> Slice two columns and two rows
   three  two
1    2.8  2.1
2    3.8  3.9
C:\Users\ip476832\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  
C:\Users\ip476832\Anaconda3\lib\site-packages\pandas\core\indexing.py:1367: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
C:\Users\ip476832\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  # Remove the CWD from sys.path while we load stuff.

Column selection, addition, deletion

We will go through these examples using a real dataset. Remember from Series examples that we loaded a csv. We did it using csv library, however, Pandas provide the necessary tools to read a csv file and output a DataFrame. Let's see an example using OpenFlight data. The data is structured as follows:

  • Airport ID Unique OpenFlights identifier for this airport.
  • Name Name of airport. May or may not contain the City name.
  • City Main city served by airport. May be spelled differently from Name.
  • Country Country or territory where airport is located.
  • IATA/FAA 3-letter FAA code, for airports located in Country "United States of America". 3-letter IATA code, for all other airports. Blank if not assigned.
  • ICAO 4-letter ICAO code. Blank if not assigned.
  • Latitude Decimal degrees, usually to six significant digits. Negative is South, positive is North.
  • Longitude Decimal degrees, usually to six significant digits. Negative is West, positive is East.
  • Altitude In feet.
  • Timezone Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
  • DST Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time
  • Tz database time zone Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".

In [ ]:
import pandas as pd
import urllib2

url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
response = urllib2.urlopen(url)
head = ["Airport ID", "Name", "City", "Country", "IATA/FAA", "ICAO", "Latitude", "Longitude", 
        "Altitude", "Timezone", "DST", "Tz database time zone"]
data_frame = pd.read_csv(response,names=head)
data_frame.head()

In [ ]:
data_frame["Name"].head()

In [ ]:
(data_frame["Name"] + data_frame["ICAO"]).head()

In [ ]:
data_frame["Altitude (m)"] = (data_frame["Altitude"] * 0.3048)
data_frame["Seaside"] = data_frame["Altitude (m)"] < 20
data_frame[data_frame["Seaside"]].head()

In [ ]:
# Columns can be deleted or popped like with a dict:
del data_frame["Altitude"]
seaside = data_frame.pop('Seaside')
print(seaside.head())
data_frame.head()

In [ ]:
# When inserting a scalar value, it will naturally be propagated to fill the column:
data_frame["Infrastructure"] = "Airport"

Assigning New Columns in Method Chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.


In [ ]:
(data_frame.assign(hemisphere_north = lambda x: x['Latitude'].astype(float) > 0)).head(10)

In [ ]:

Exercises

Exercise 1: Load iqsize.csv using csv library. The result should be a DataFrame.

Exercise 2: Identify dataset variables. Resolve type problems. Can you change types?

Exercise 3: Check the range of quantitative variables. Are they correct? If not correct how would you correct it (don't expend many time).

Exercise 4: Check the labels of qualitative variables. Are they correct? If not correct them, how would you correct them?

Exercise 5: For quantitative variables, compute the mean and median.

Exercise 6: For qualitative variables, count how many observations of each label exist.

Exercise 7: Compute Exercise 7 statistics, but now for each label in Sex variable.